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Abstract 

In 2] we gave a metric on the class of semibinary tree-sibling time 
consistent phylogenetic networks that is computable in polynomial time; 
in particular, the problem of deciding if two networks of this kind are 
^ ' isomorphic is in P. In this paper, we show that if we remove the semib- 

t inarity condition above, then the problem becomes much harder. More 

precisely, we proof that the isomorphism problem for generic tree-sibling 
time consistent phylogenetic networks is polynomially equivalent to the 
graph isomorphism problem. Since the latter is believed to be neither in P 
£N| ■ nor NP-complete, the chances are that it is impossible to define a metric 

on the class of all tree-sibling time consistent phylogenetic networks that 
ON , can be computed in polynomial time. 

O 

> 

x. 

After the realization that reticulation processes, like hybridizations, recombi- 
nations or lateral gene transfers, have been more relevant in the evolution of 
life on Earth than previously thought [B], there has been a growing interest in 
the development of algorithms for the reconstruction of phylogenetic networks: 
graphical models of evolutionary histories that go beyond phylogenetic trees by 
including hybrid nodes of in-degree greater than one representing reticulation 
events. As the number of available such algorithms increases, the need of meth- 
ods for the comparison of phylogenetic networks also increases, as they are used, 
for instance, to assess the reliability and robustness of these algorithms [T21 PH] . 

One of the types of phylogenetic networks for which there exist reconstruc- 
tion methods [£l [10] are the tree-sibling time consistent networks, TSTC net- 
works, for short (see for a formal definition) . There have been several at- 
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tempts to define a metric on the class of all TSTC networks on a given set of 
taxa [13] , and we have recently given a metric on the class of all semibinary 
TSTC networks, where all hybrid nodes have in-degree two [5] , but none of the 
metrics for phylogenetic networks computable in polynomial time proposed so 
far satisfies the separation axiom (distance means isomorphism) for generic 
TSTC networks: see [31 H]. In this paper we show why it should come as no 
surprise: such a metric would solve in polynomial time the graph isomorphism 
problem. 

The graph isomorphism problem is one of the most important decision prob- 
lems for which the computational complexity is not known yet 111) . It is 
believed to be neither in P nor NP-complete, and subexponential time solutions 
for it are known. A problem is said to be graph isomorphism- complete when 
it is polynomially equivalent to the graph isomorphism problem. In this paper 
we show that, for every set 5* with more than two elements, the isomorphism 
problem for TSTC phylogenetic networks with taxa bijectively labeled in S is 
graph isomorphism-complete. 

2 Preliminaries 

Let G — (y,E) be a non-empty rooted directed acyclic graph (a rDAG, for 
short). A node of G is a leaf if it has out-degree 0, internal if its out-degree is 
^ 1, of tree type if its in-degree is ^ 1, of hybrid type if its in-degree is > 1, and 
elementary if it is a tree node of out-degree 1. A node v is a child of another 
node u (and, hence, u is a parent of v) if (u, v) € E. Two nodes u and v are 
siblings of each other if they share a parent. An arc (u, v) in a rDAG is a tree 
arc when v is a tree node, and a hybridization arc when v is a hybrid node. The 
height of a node v is the longest length of a directed path from v to a leaf, and 
the depth of v is the longest length of a directed path from the root to v. 

Given a finite set S of labels, a S-rDAG is a rDAG with its leaves injectively 
labelled by S. By an isomorphism of S-rDAGs we understand an isomorphism 
of directed graphs that preserves the labelling, that is, that maps each leaf in 
one network to the leaf with the same label in the other network (in particular, 
isomorphic S'-rDAGs must have the same sets of actual leaf labels). In a S- 
rDAG, we shall always identify without any further reference every leaf with its 
label. 

A phylogenetic network on a set S of taxa is a S-rDAG such that: 

• No tree node is elementary. 

• Every hybrid node has out-degree 1, and its single child is a tree node. 

We will say that a phylogenetic network is tree-sibling if every hybrid node has 
at least one sibling that is a tree node. 

A temporal assignment pQ on a network N = (V, E) is mapping r : V — > N 
such that: 

(a) If v is a hybrid node and (u, v) 6 E, then t(u) = t(v). 
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(b) If v is a tree node and (u,v) E E, then t(u) < t(v). 

We will say that a phylogenetic network is time- consistent if it admits a temporal 
assignment. The following alternative characterization of time consistency will 
be used later. For a proof, see [TJ[5]- 

Proposition 1 Let N — (V, E) be a phylogenetic network, let Eh be its set of 

hybridization arcs, and let N* — (V, E*) be the directed graph with the same set 
V of nodes as N and set of arcs E* — E U {(v,u) | (u, v) £ Eh}- Then, N is 
time consistent if and only if, N* does not have any cycle containing some tree 
arc of N. 

The underlying biological motivation for the definitions on phylogenetic net- 
works introduced so far is the following. In a phylogenetic network, tree nodes 
model species (either extant, the leaves, or non-extant, the internal tree nodes), 
while hybrid nodes model reticulation events, where different species interact to 
create new species, the parents of the hybrid node being the species involved in 
this event and its single child being the resulting species. The tree children of 
a tree node represent direct descendants through mutation. The first condition 
in the definition of phylogenetic network says that every non-extant species is 
assumed to have at least two different direct descendants, be them by mutation 
or through some reticulation event. This is a very common restriction in any 
definition of phylogeny (be it a tree or a network), since species with only one 
child cannot be reconstructed from biological data. 

The tree-sibling condition says then that, for every reticulation event, at least 
one of the species involved in it must have some descendant through mutation. 
This condition was introduced with the name class I in L. Nakhleh's PhD 
Thesis [13] , and it has reappeared in several phylogenetic network reconstruction 
methods [9J HD] . As far as the time consistency goes, we understand that the 
time assigned to a node represents the time when the corresponding species 
existed, or when the reticulation event took place. The first condition in time 
consistency means then that the species involved in a reticulation event must 
coexist in time in order to interact, while the second condition means that 
speciation takes some amount of time to take place. 

3 Main Results 

It is well known [7l 1 1 5 j that the isomorphism problem for rDAGs is graph 
isomorphism-complete. It turns out that the isomorphism problem for rDAGs 
with their leaves bijectively labeled in any given set of labels is also graph 
isomorphism-complete: since we have not been able to find a proof of this easy 
result in the literature, we provide one here. 

Proposition 2 For every non-empty set S of labels, the isomorphism for S- 
rDAGs is graph isomorphism-complete. 
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Proof. Without any loss of generality, we assume that S = {1, . . . , n} C N. 

Let us prove first that the isomorphism of S*-rDAGs reduces to the isomor- 
phism of rDAGs. For every S"-rDAG G, let G' be the rDAG obtained from G 
by unlabelling its leaves and then, for each k = 1, . . . , n, if G contained a leaf 
labeled with k, then adding to this leaf k tree-children leaves; see Fig. [1] The 
construction of G' from G — (V, E) adds 0(n 2 ) ^ 0(|F| 2 ) nodes and arcs, and 
therefore it is polynomial in the size of G. And G can be reconstructed from G' 
by simply replacing, for each k = 1, . . . , n, the node of height 1 with k leaves by 
a leaf labeled with k. Then, it is straightforward to check that, for every pair 
of 5-rDAGs G x and G 2 over S, G x = G 2 as 5-rDAGs if, and only if, G\ = G' 2 
as rDAGs. 

Let us prove now that the isomorphism of rDAGs reduces to the isomorphism 
of S-rDAGs. For every rDAG G, let G" be the S'-rDAG obtained from G by 
adding a new node a, arcs from each leaf of G to a and finally adding one 
child leaf to a labeled 1; see Fig. [2] The construction of G" from G = (V, E) 
adds 2 nodes and 0(|V|) arcs, and therefore it is polynomial. And G can be 
reconstructed from G" by simply removing its leaves and its only height 1 node, 
a, as well as all arcs pointing to a or to the leaves. It is straightforward to 
check that, for every pair of rDAGs Gi and G 2 over S, Gi = G 2 if, and only if, 
G'[ = G'2 as 5-rDAGs. 




Figure 1: The construction involved in the reduction of the isomorphism of 
S'-rDAGs to the isomorphism of rDAGs. 




Figure 2: The construction involved in the reduction of the isomorphism of 
rDAGs to the isomorphism of S'-rDAGs. 
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Theorem 1 For every set S with \S\ ^ 3, the isomorphism of TSTC'-networks 
on a set S of taxa is graph isomorphism-complete. 

Proof. Without any loss of generality, we assume that S = {1, . . . , n} C N. 

The isomorphism of TSTC-networks on S clearly reduces to the isomorphism 
of S'-rDAGs, since the former are a special case of the latter. Let us prove now 
the converse reduction. 

Let N = (V,E) be a S-rDAG. Let N be the (S U {n + l,n + 2})-rDAG 
obtained as follows: 

(1) For every hybrid node h in N, remove all arcs from h to its children, and 
then add a new (tree) node Uh, an arc from h to Uh, and new arcs from Uh 
to the children of h in N. If h was a leaf, say with label k, then Uh becomes 
the new leaf labeled with k. 

(2) For every hybridization arc e = (v, h) in the resulting 5-rDAG, split it into 
arcs (v,v e ) and (v e , h), with v e a new (tree) node. 

Let N' denote the resulting S-rDAG after these two first steps. 

(3) For every internal tree node v in JV', add a new (tree) node v' and an arc 
(v,v>). ' 

(4) Split the arc (w,n) in A' pointing to the leaf n into two arcs (w,w n ) and 
(w n ,n). 

(5) Add two new nodes a and 6, and, for every node v' added in step (3), add 
arcs (v',a) and (v',b). Add also arcs (w n ,a) and (w n ,b). The nodes a and 
b will be hybrid. 

(6) Add a tree leaf children labelled n + 1 to a, and another one labelled n + 2 
to b. 

An example of this construction is displayed in Fig. [3J 

Let us prove now that A is a tree-sibling time consistent phylogenetic net- 
work. 

• It is rooted (with the same root as N) and acyclic, because all new arcs 
are either used to split arcs in N into pairs of consecutive arcs, or to define 
paths that end in the new leaves n + 1 or n + 2 without forming cycles. 

• It has no elementary nodes. Indeed, the internal tree nodes in N get an 
extra child in step (3), and the tree nodes that are added to N either get 
an extra child in step (3) or they get two children in (5). 

• Its hybrid nodes have only one child, and it is a tree node: this is ensured 
for the hybrid nodes in N in step (1), and for the new hybrid nodes a and 
b by construction. 
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• It is tree-sibling. All hybrid nodes in N get a tree sibling in steps (2) and 
(3) (for every hybrid node h in N, if e is any arc pointing to h, then the 
tree child v' e of the new node v e added in the middle of e is such a tree 
sibling of h), and the hybrid nodes a and b have the tree sibling n. 

• It is time consistent. To check this, we use Proposition[T](and the notations 
introduced therein). Since we already know that N is acyclic, any cycle 
in N must contain some inverse of a hybridization arc. There are two 
possibilities for this inverse. If it has the form (h,x), with h one of the 
new hybrid nodes a or b introduced in step (5) and x one of the tree nodes 
v' introduced in step (3) or the tree node w n introduced in step (4), then 
the only tree arcs that can be reached from x in N are those pointing to 
the leaves n, n + 1 or n + 2, and therefore no cycle in N contains this arc 
(h, x) together with a tree arc. And if this inverse is of the form (h, v e ), 
with h a hybrid node in N and v e one of the tree nodes introduced in step 

(2) , then it must be followed in the cycle by the arc (v e , v' e ) added in step 

(3) , and, as we have just said, the only tree arcs that can be reached from 
v' e point to a leaf, and hence no cycle in N contains this arc (h,v' e ) and 
a tree arc, either. 

It is clear that the construction of N from N adds 0(\V\ + \E\) nodes and 
arcs to TV, and thus it is polynomial in the size of N. 

Now, the 5-rDAG N can be easily reproduced from N by simply undoing 
its construction: 

(5) Remove the leaves n + 1 and n + 2 and its hybrid parents a and 6, together 
with all arcs pointing to them. 

(4) Remove the elementary parent of the leaf n (which will be the remaining 
leaf with largest label in S) and replace it by an arc from the parent of 
the removed node to n. 

(3) Remove all non-labeled leaves of the resulting rDAG together with the 
arcs pointing to them 

(2) Remove each parent v e of every hybrid node, and replace it by an arc from 
the parent of v e to the hybrid child of v e ■ 

(1) Remove the only tree child of each hybrid node, and replace it by an arc 
from the hybrid node to each one of the children of the removed node. 

(0) The resulting S-rDAG is N. 

It is straightforward to check now that, for every pair of iS-rDAGs N± and N 2 , 
Ni = N 2 if, and only if, Ni = N 2 as phylogenetic networks over 5U{n+l, n+2}. 

We cannot remove the condition \S\ ^ 3 in the previous result because there 
are only two TSTC phylogenetic networks with less than 3 leaves (up to the 
actual names of the labels). In particular, this implies that, in the proof of the 
previous result, we cannot add less than 2 new leaves in the construction of N 
from N. 
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Proposition 3 There is only one TSTC phylogenetic network on and only 
one TSTC phylogenetic network on {1,2}, and in both cases they are trees. 

Proof. The {l}-rDAG consisting of a single node, labeled 1, and the {1,2}- 
rDAG consisting of the phylogenetic tree with Newick code (1,2) ; are clearly 
TSTC phylogenetic networks. Let us check now that any other TSTC phyloge- 
netic network has at least 3 leaves. 

Let N = (V, E) be a TSTC phylogenetic network other than those described 
in the last paragraph, let r : V — > N be a time assignment, and let v be 
an internal node with largest r-value and, among those with this largest time 
assignment, of largest depth. 

If v is a tree node, then all its children are either leaves or hybrid nodes with 
leaf children (because any tree descendant node of v has time assignment larger 
than t(v)). And v 's hybrid children would have the same time assignment as i>, 
but depth largest than v's depth, against the assumption. Therefore all children 
of v are leaves, and it has at least 2 children, because it cannot be elementary. 
Now, if v has more than 2 children, we are done, while if it has only two children, 
say the leaves 1 and 2, then v will have a parent in TV (because N is not the 
tree (1,2) ;). If the parent of v is a tree node, let w be this node, and let z 
be another child of w. Since N does not contain cycles, and any path to 1 or 
2 must contain w, we deduce that any descendant leaf of z must be different 
from 1 or 2: this gives at least 3 leaves. If, on the contrary, the parent of v is a 
hybrid node x, let w be the parent of x that has a tree child, say z. The time 
consistency prevents x to be a descendant of z (because t(z) > t(w) = t(x)) 
and therefore, since any path leading to 1 or 2 must contain x, any leaf that is 
a descendant of z will be different from 1, 2: this gives again at least 3 leaves. 

If v is a hybrid node, then its child is a leaf, say 1. Let v\ be a parent of 
v that has a tree child. Since r(«i) = r(u) is the largest r value of an internal 
node of N, this tree child must be a leaf, say 2. Now let V2 be another parent 
of v. Since it is a tree node, it must have another child other than v, say x. 
If a; is a tree node, it is a leaf, as we have just seen. If x is hybrid, then since 
t(x) = t(v2) = t(v), the tree child of x must be a leaf. In both cases, we obtain 
a leaf that is different from 1 and 2, that is, N contains at least 3 leaves. 

4 Conclusion 

We have proved that, unless the graph isomorphism problem belongs to P, there 
is no hope of defining a polynomially computable metric on the class of all TSTC 
phylogenetic networks on a set S of at least 3 taxa. It remains open the problem 
of defining polynomially computable, and biologically sound, metrics on the class 
of all TSTC phylogenetic networks on a given set S with all their hybrid nodes 
with in-degree bounded by some d 6 N. When d = 2, the /^-distance is such 
a metric [2], but it is no longer a metric for d = 4 (see the Supplementary 
Material to the aforementioned paper). Actually, we do not even know whether 
the isomorphism problem for TSTC phylogenetic networks on a given set S of 
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taxa with globally bounded in-degree hybrid nodes (but without bounding the 
out-degree of the tree nodes; otherwise, Luks' theorem [8] would apply) is in P, 
but we conjecture that this is the case. 
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